Use this template to complete your project throughout the course. Your Final Project presentation in class will be based on the contents of this document. Replace the title/name and text below with your own, but keep the headers.
The goal of my project is to explore the potential relationship between alcohol consumption and geographic/economic causes, such as median household income. I use the dataset from GHDx named United States Alcohol Use Prevalence by County 2002-2012 to get information of the alcohol pattern across the U.S. (by year and sex). Additionally, I use a map shapefile from the ZevRoss to create map plots, and data from the American Community Survey (ACS) to get the information of median household income.
Faculty/Staff: Dr. Bomyi Lim, Sherrie Xie (Graduate student), Dr. Blanca Himes Dr. Bomyi Lim: helped shape my idea. Sherrie Xie (Ph.D. student): helped with maps and the correlation test Dr. Blanca Himes: helped with maps and notified some limitations of my study.
Alcohol consumption is the 7th leading risk factor for both death and disability all over the world. It led to 2.8 million deaths in 2016. Although several previous research suggests that moderate level of alcohol consumption brings benefit to humans, this finding has resently been questioned. As a matter of fact, a new study published in 2018 (Griswold et al. 2018) states that the level of alcohol consumption which minimize harm on human health is zero. Additionally, negative effect of alcohol on health outcomes raise with the increase in the level of consumption. Therefore, while it is impossible to fully stop populational alcohol use, analyzing factors which can potentially affect alcohol consumption, such as geographic or economic factors, helps develop preventive strategies.
+I use the dataset from GHDx named United States Alcohol Use Prevalence by County 2002-2012 to get information of the alcohol pattern across the U.S. (by year and sex). Additionally, I use a map shapefile from the US Census Bureau to create map plots, and data from the American Community Survey (ACS) to get the information of median income/unemployment/poverty rate.
In the first paragraph, describe the data used and general methodological approach. Subsequently, incorporate full R code necessary to retrieve and clean data, and perform analysis. Be sure to include a description of code so that others (including your future self) can understand what you are doing and why.
##
## The downloaded binary packages are in
## /var/folders/6c/wrs0172s3dzg403t4945383w0000gn/T//RtmpWOdNs9/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/6c/wrs0172s3dzg403t4945383w0000gn/T//RtmpWOdNs9/downloaded_packages
library(ggplot2) # For plotting
library(tidycensus) # For downloading Census data
library(tmap) # For creating tmap
library(tmaptools) # For reading and processing spatial data related to tmap
library(dplyr) # For data wrangling##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 2.1.3 ✔ purrr 0.3.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
## Loading required package: sp
## Checking rgeos availability: TRUE
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
Import and clean up alcohol consumption data preliminarily
#load data
alcohol_any <- read_excel("alcohol_2002_2012.xlsx", sheet = "Any")
alcohol_heavy <- read_excel("alcohol_2002_2012.xlsx", sheet = "Heavy")
alcohol_binge <- read_excel("alcohol_2002_2012.xlsx", sheet = "Binge")
str(alcohol_any)## Classes 'tbl_df', 'tbl' and 'data.frame': 3179 obs. of 41 variables:
## $ State : chr "National" "Alabama" "Alabama" "Alabama" ...
## $ Location : chr "United States" "Alabama" "Autauga County" "Baldwin County" ...
## $ 2002 Both Sexes : num 55.4 40.7 39.4 54 36 31.1 31.9 34.9 34.5 38 ...
## $ 2002 Females : num 47.5 32.1 29.4 45.7 27.3 21.7 22.1 26.5 25.8 27.9 ...
## $ 2002 Males : num 63.7 49.6 49.7 62.5 45 40.7 42.1 43.6 43.4 48.4 ...
## $ 2003 Both Sexes : num 56.6 42.3 40.6 54.9 37.9 33.3 34.1 37 35.5 39.5 ...
## $ 2003 Females : num 48.9 33.8 31 47 29.2 24 24 28.3 26.9 29.3 ...
## $ 2003 Males : num 64.6 51.2 50.6 63.2 46.9 42.9 44.5 46.1 44.5 50.1 ...
## $ 2004 Both Sexes : num 55.2 41 39.2 53.1 35.8 32.3 33.1 34.5 33.8 38.4 ...
## $ 2004 Females : num 47.6 32.9 30.1 45.2 27.4 23.5 23.5 26.2 25.6 28.7 ...
## $ 2004 Males : num 63.2 49.5 48.6 61.3 44.5 41.5 43.1 43.2 42.3 48.3 ...
## $ 2005 Both Sexes : num 56 42.3 40.7 54.4 37.4 33.9 34.5 34.2 35.2 39.6 ...
## $ 2005 Females : num 48.7 33.8 31.6 46.5 28.7 24.7 24.3 25.5 26.6 29.8 ...
## $ 2005 Males : num 63.5 51.1 50.1 62.7 46.4 43.4 45 43.2 44 49.8 ...
## $ 2006 Both Sexes : num 55.4 40.7 38.8 53.1 34.4 32 32.5 33.9 34 37.8 ...
## $ 2006 Females : num 48.5 33.9 31.7 46.9 27.3 24.6 23.8 26.5 27.2 29.8 ...
## $ 2006 Males : num 62.7 47.9 46.1 59.6 41.7 39.7 41.6 41.6 41.1 46.1 ...
## $ 2007 Both Sexes : num 56 41.6 40.6 54.2 35.5 33.1 33.3 34.2 34.8 38.8 ...
## $ 2007 Females : num 48.9 34.1 33.1 46.9 27.6 25.1 23.8 26 27.3 30.3 ...
## $ 2007 Males : num 63.4 49.4 48.3 61.7 43.7 41.5 43.2 42.6 42.6 47.6 ...
## $ 2008 Both Sexes : num 56.1 42.6 41.2 55.1 37 34.2 34.1 36.5 35.6 38.9 ...
## $ 2008 Females : num 49 35.5 33.7 48.2 29.1 26.7 25.3 28.6 27.8 31.2 ...
## $ 2008 Males : num 63.4 50 48.9 62.3 45.3 42.1 43.2 44.7 43.7 46.9 ...
## $ 2009 Both Sexes : num 56 41.3 40.8 53.6 35.2 33.3 32.3 36.2 33.2 37.3 ...
## $ 2009 Females : num 48.9 33.9 32.7 46 27.2 25.4 23.5 27.7 24.5 29.3 ...
## $ 2009 Males : num 63.3 49 49.2 61.5 43.5 41.5 41.5 45 42.1 45.5 ...
## $ 2010 Both Sexes : num 56.1 42.5 42.5 54.6 38.6 34.4 33.5 37.6 35.2 37.3 ...
## $ 2010 Females : num 49.3 35.3 34.4 47.3 30.2 26.5 24.7 28 26.4 29.7 ...
## $ 2010 Males : num 63.2 50.1 50.9 62.1 47.4 42.6 42.6 47.7 44.4 45.2 ...
## $ 2011 Both Sexes : num 57.1 44.8 43.6 57 39.4 36.7 36 38.9 38.2 39.6 ...
## $ 2011 Females : num 50.6 37.3 35.4 50 31 28.5 27 29.1 28.8 31.6 ...
## $ 2011 Males : num 63.9 52.5 52 64.3 48.2 45.2 45.4 49.1 47.9 47.8 ...
## $ 2012 Both Sexes : num 56 43.6 42.5 55.7 37.6 35.8 34.9 37.1 35.8 38.2 ...
## $ 2012 Females : num 49.1 36.5 34.4 48.8 29.2 28 26.3 27.3 26.6 30.1 ...
## $ 2012 Males : num 63 51 50.9 62.8 46.3 43.9 43.9 47.3 45.3 46.6 ...
## $ Percent Change 2002-2012, Both Sexes: num 0.9 7.2 7.9 3.2 4.6 15.3 9.4 6.4 3.8 0.5 ...
## $ Percent Change 2002-2012, Females : num 3.5 13.5 17.1 6.8 7.2 29.1 19.3 3.3 3 7.8 ...
## $ Percent Change 2002-2012, Males : num -1 2.9 2.3 0.4 3 7.7 4.1 8.3 4.3 -3.8 ...
## $ Percent Change 2005-2012, Both Sexes: num 0 3.1 4.5 2.2 0.7 5.7 1.4 8.6 1.7 -3.6 ...
## $ Percent Change 2005-2012, Females : num 0.9 7.8 8.9 5 1.9 13.6 8.4 7.2 -0.2 1.1 ...
## $ Percent Change 2005-2012, Males : num -0.8 -0.1 1.6 0.1 -0.1 1.1 -2.5 9.5 2.9 -6.5 ...
## Classes 'tbl_df', 'tbl' and 'data.frame': 3179 obs. of 29 variables:
## $ State : chr "National" "Alabama" "Alabama" "Alabama" ...
## $ Location : chr "United States" "Alabama" "Autauga County" "Baldwin County" ...
## $ 2005 Both Sexes : num 7 6 6.2 9 5.4 5.1 4.8 5.3 5.4 4.9 ...
## $ 2005 Females : num 5.2 3.6 3.4 6.7 2.5 2.3 2.1 1.7 2.3 2.7 ...
## $ 2005 Males : num 8.9 8.4 9 11.3 8.3 8.1 7.6 9.1 8.7 7.2 ...
## $ 2006 Both Sexes : num 7.2 5.6 5.6 8.4 5 4.8 4.4 5.2 5.2 4.5 ...
## $ 2006 Females : num 5.5 3.4 3 6.1 2 2 1.9 1.6 2 2.5 ...
## $ 2006 Males : num 8.9 7.9 8.3 10.8 8.1 7.6 7.1 8.9 8.4 6.5 ...
## $ 2007 Both Sexes : num 7.7 5.6 5.6 8.4 5.1 4.7 4.4 5.2 5.2 4.5 ...
## $ 2007 Females : num 5.9 3.5 3.3 6.1 1.9 2.1 1.9 1.5 1.9 2.6 ...
## $ 2007 Males : num 9.5 7.8 8 10.8 8.3 7.5 7 9 8.5 6.4 ...
## $ 2008 Both Sexes : num 7.8 6.2 6 9.1 5.6 5.2 4.8 5.8 5.7 4.9 ...
## $ 2008 Females : num 6.1 3.8 3.5 6.5 2.1 2.3 1.9 1.7 2.1 2.8 ...
## $ 2008 Males : num 9.6 8.6 8.5 11.8 9.3 8.2 7.7 10.1 9.4 7.1 ...
## $ 2009 Both Sexes : num 7.7 6 5.7 8.6 5.5 5 4.6 6.1 5.3 4.8 ...
## $ 2009 Females : num 6.1 3.7 3.4 6 1.8 2.1 1.8 1.8 1.7 2.8 ...
## $ 2009 Males : num 9.3 8.3 8.1 11.3 9.3 7.9 7.4 10.5 9.1 6.8 ...
## $ 2010 Both Sexes : num 7.8 6 5.8 8.7 6.1 5 4.6 6.5 5.5 4.7 ...
## $ 2010 Females : num 6.2 3.9 3.6 6.4 2.3 2.1 1.7 2.1 1.9 2.7 ...
## $ 2010 Males : num 9.4 8.3 8.1 11.1 10 8 7.5 11.1 9.3 6.8 ...
## $ 2011 Both Sexes : num 8.8 7.3 6.9 10.4 7.4 6.1 5.5 7.9 6.9 5.9 ...
## $ 2011 Females : num 7 4.6 4 7.8 2.5 2.6 2.2 2.4 2.5 3.3 ...
## $ 2011 Males : num 10.6 10.1 10 13.1 12.4 9.7 9 13.7 11.5 8.5 ...
## $ 2012 Both Sexes : num 8.2 6.6 6.3 9.3 6.4 5.5 5 7 6 5.3 ...
## $ 2012 Females : num 6.7 4.3 3.8 7.3 2.3 2.5 2 2.3 2.3 3.1 ...
## $ 2012 Males : num 9.9 8.9 8.8 11.4 10.5 8.6 8.1 11.9 9.8 7.6 ...
## $ Percent Change 2005-2012, Both Sexes: num 17.2 9.8 1.7 3.4 18.3 7.2 4.6 31.8 10.5 8.3 ...
## $ Percent Change 2005-2012, Females : num 28.2 19.1 12.2 7.7 -7.6 8.7 -4.3 38.5 1.4 14.6 ...
## $ Percent Change 2005-2012, Males : num 10.5 5.6 -2.4 0.8 26.4 6.7 7.2 30.6 13 5.9 ...
## Classes 'tbl_df', 'tbl' and 'data.frame': 3179 obs. of 41 variables:
## $ State : chr "National" "Alabama" "Alabama" "Alabama" ...
## $ Location : chr "United States" "Alabama" "Autauga County" "Baldwin County" ...
## $ 2002 Both Sexes : num 17.3 13.2 13.6 17.9 12.8 11.3 10.2 12.2 11.9 12.2 ...
## $ 2002 Females : num 10.8 6.3 6.6 9.7 5.4 4.4 4.1 4.4 4.9 5.6 ...
## $ 2002 Males : num 24.1 20.4 20.8 26.4 20.4 18.5 16.5 20.2 19 19.1 ...
## $ 2003 Both Sexes : num 17.8 13.1 13.3 17.4 12.5 11.3 10.4 12.5 11.6 12.1 ...
## $ 2003 Females : num 11.2 6.6 6.9 9.8 5.6 4.8 4.5 4.9 5.1 5.9 ...
## $ 2003 Males : num 24.7 19.8 20 25.3 19.6 18 16.5 20.3 18.4 18.6 ...
## $ 2004 Both Sexes : num 17.1 13.2 13 17.4 12.1 11.4 10.6 12.2 11.7 12.3 ...
## $ 2004 Females : num 10.7 6.7 6.5 9.5 5.2 4.9 4.6 4.5 4.9 6 ...
## $ 2004 Males : num 23.6 20 19.8 25.7 19.4 18.1 16.9 20.2 18.7 18.8 ...
## $ 2005 Both Sexes : num 16.8 12.8 12.8 16.8 11.9 11 10.5 11.2 10.9 11.8 ...
## $ 2005 Females : num 10.6 6.8 6.8 9.6 5.4 5.1 4.8 4.2 4.8 6.1 ...
## $ 2005 Males : num 23.4 19 18.9 24.3 18.6 17.2 16.4 18.5 17.3 17.8 ...
## $ 2006 Both Sexes : num 16.8 12.5 12.1 16.6 10.7 10.5 10 11.5 10.7 11.4 ...
## $ 2006 Females : num 10.8 7.3 7 10.3 5 5.3 4.9 4.8 5.1 6.4 ...
## $ 2006 Males : num 22.9 17.9 17.4 23.1 16.7 15.9 15.4 18.4 16.5 16.6 ...
## $ 2007 Both Sexes : num 17.5 12.9 12.8 16.9 11.1 10.9 10.3 11.8 11 11.7 ...
## $ 2007 Females : num 11.4 7.5 7.5 10.4 5 5.5 5 4.9 5.2 6.6 ...
## $ 2007 Males : num 23.8 18.5 18.2 23.7 17.3 16.4 15.9 19 17.1 16.9 ...
## $ 2008 Both Sexes : num 17.4 13.7 13.1 17.7 12 11.6 10.8 12.8 11.9 12.3 ...
## $ 2008 Females : num 11.4 7.7 7.3 10.5 5.2 5.7 4.8 5.2 5.3 6.6 ...
## $ 2008 Males : num 23.6 19.9 19.2 25.2 19.1 17.8 16.9 20.7 18.8 18.3 ...
## $ 2009 Both Sexes : num 17.6 13.4 13 17 11.5 11.7 10.5 13.2 11.2 12 ...
## $ 2009 Females : num 11.5 7.6 7.4 10 5 5.6 4.6 5.4 4.8 6.4 ...
## $ 2009 Males : num 23.8 19.4 18.8 24.2 18.3 17.9 16.6 21.2 17.9 17.8 ...
## $ 2010 Both Sexes : num 17.8 13.4 13.3 16.7 12.7 11.4 10.2 14 11.8 11.6 ...
## $ 2010 Females : num 11.8 7.8 7.9 9.9 5.9 5.5 4.4 5.9 5.4 6.3 ...
## $ 2010 Males : num 24 19.1 18.8 23.7 19.8 17.5 16.2 22.4 18.3 17.2 ...
## $ 2011 Both Sexes : num 19 14.7 14.4 18.6 13.5 12.4 11.3 15.4 13.3 13 ...
## $ 2011 Females : num 12.8 8.9 8.8 11.7 6.3 6.3 5.2 6.8 6.4 7.3 ...
## $ 2011 Males : num 25.6 20.7 20.1 25.8 21 18.8 17.6 24.4 20.4 18.8 ...
## $ 2012 Both Sexes : num 18.3 13.4 13.2 16.9 12.4 11.4 10.3 14.1 12 11.8 ...
## $ 2012 Females : num 12.4 7.9 7.9 10.4 5.4 5.7 4.6 5.9 5.5 6.4 ...
## $ 2012 Males : num 24.5 19.1 18.7 23.7 19.6 17.4 16.2 22.7 18.7 17.5 ...
## $ Percent Change 2002-2012, Both Sexes: num 5.8 1.2 -2.7 -5.7 -3.1 0.9 0.9 16.2 1.2 -3.2 ...
## $ Percent Change 2002-2012, Females : num 14.8 24.9 20.4 6.4 0.8 29.3 11.1 33.2 11.7 14.2 ...
## $ Percent Change 2002-2012, Males : num 1.6 -6.4 -10.3 -10.3 -4.1 -6.1 -1.8 12.4 -1.6 -8.5 ...
## $ Percent Change 2005-2012, Both Sexes: num 8.9 4.6 3.4 0.4 4.1 3.3 -2 25.9 10 0.2 ...
## $ Percent Change 2005-2012, Females : num 17.5 15.2 16 7.7 0.4 11.5 -4.6 39.4 15.8 5.2 ...
## $ Percent Change 2005-2012, Males : num 4.9 0.7 -1.3 -2.7 5.2 0.8 -1.2 22.7 8.3 -1.6 ...
#merge dataframe (alcohol_heavy lack the data from 2002-2004)
alcohol_any <- mutate(alcohol_any, Degree = "Any")
alcohol_binge <- mutate(alcohol_binge, Degree = "Binge")
alcohol_heavy <- mutate(alcohol_heavy, Degree = "Heavy", "2002 Both Sexes" = NA, "2003 Both Sexes" = NA, "2004 Both Sexes" = NA, "2002 Males" = NA, "2003 Males" = NA, "2004 Males" = NA, "2002 Females" = NA, "2003 Females" = NA, "2004 Females" = NA, "Percent Change 2002-2012, Both Sexes" = NA, "Percent Change 2002-2012, Females" = NA, "Percent Change 2002-2012, Males" = NA)
combine_alcohol <- rbind(alcohol_any, alcohol_heavy, alcohol_binge)First, I compare national change of alcohol use (by sexes)
#compare national change of alcohol consumption by sexes (2002 vs. 2012, any vs. binge (2002-2004 heavy data is missing))
#extract data
c_ten <- c(36, 37, 38, 42)
ten_years <- combine_alcohol%>%
filter(State == "National")%>%
filter(Degree == "Any"|Degree =="Binge")
#clean up
ten_years <- ten_years[c_ten] %>%
rename(Both = "Percent Change 2002-2012, Both Sexes", Males = "Percent Change 2002-2012, Males", Females = "Percent Change 2002-2012, Females")
ten_years <- melt(ten_years, id.vars = "Degree")
#plot
ggplot(ten_years, aes(x = Degree, y = value, fill = variable)) +
geom_col(stat = "identity", position = "dodge") +
scale_fill_manual("Sexes", values = c("Both" = "green3", "Females" = "coral1", "Males" = "cornflowerblue")) +
ylab("Change between 2002 and 2012 (%)")## Warning: Ignoring unknown parameters: stat
#compare national change of alcohol consumption by sexes (2005 vs. 2012, any vs. heavy vs. binge)
#extract data
eight_years <- combine_alcohol%>%
filter(State == "National")
#clean up
eight_years <- eight_years[39:42] %>%
rename(Both = "Percent Change 2005-2012, Both Sexes", Males = "Percent Change 2005-2012, Males", Females = "Percent Change 2005-2012, Females")
eight_years <- melt(eight_years, id.vars = "Degree")
#plot
ggplot(eight_years, aes(x = Degree, y = value, fill = variable)) +
geom_col(stat = "identity", position = "dodge") +
scale_fill_manual("Sexes", values = c("Both" = "green3", "Females" = "coral1", "Males" = "cornflowerblue")) +
ylab("Change between 2005 and 2012 (%)")## Warning: Ignoring unknown parameters: stat
Given that the data of heavy alcohol use is not available from 2002-2004, I use data from 2005-2012 in later analyzation. The bar plots show that the change in alcohol prevalence differs by sex and time (2005 vs. 2012). Therefore, I make a box plot to analyze this difference in 2005 and 2012.
#clean up data: heavy 2005 vs. 2012 percent change at state level
alcohol_heavy_state_change <- alcohol_heavy%>%
filter(State == Location)%>%
rename(Both = "Percent Change 2005-2012, Both Sexes", Males = "Percent Change 2005-2012, Males", Females = "Percent Change 2005-2012, Females")%>%
select(State, Both, Males, Females)
#transform dataframe for box plot
box_heavy_state_change <- alcohol_heavy_state_change%>%
select(Males, Females)
box_heavy_state_change <- melt(box_heavy_state_change)%>%
rename(Sexes = "variable", Prevalence = "value")## No id variables; using all as measure variables
heavy_vs_sex <- glm(Sexes ~ Prevalence, data = box_heavy_state_change, family = binomial())
summary(heavy_vs_sex)##
## Call:
## glm(formula = Sexes ~ Prevalence, family = binomial(), data = box_heavy_state_change)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.59777 -0.80146 -0.01314 0.79425 1.90621
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.72360 0.60452 -4.505 6.63e-06 ***
## Prevalence 0.10526 0.02172 4.847 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 141.40 on 101 degrees of freedom
## Residual deviance: 100.65 on 100 degrees of freedom
## AIC: 104.65
##
## Number of Fisher Scoring iterations: 5
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -4.0260207 -1.6365349
## Prevalence 0.0666349 0.1523813
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.06563821 0.0178452 0.1946534
## Prevalence 1.11099902 1.0689051 1.1646042
#plot
ggplot(data = box_heavy_state_change, aes(x= Sexes, y=Prevalence)) +
geom_boxplot() +
scale_fill_manual(values = c("#ff7256", "#6495ed")) +
ylab("Prevalence Change (%)") +
xlab("Gender") +
ggtitle("Heavy Drinking Prevalence Change, Male vs. Female") +
theme(plot.title = element_text(hjust = 0.5))#clean up data: binge 2005 vs. 2012 percent change at state level
alcohol_binge_state_change <- alcohol_binge%>%
filter(State == Location)%>%
rename(Both = "Percent Change 2005-2012, Both Sexes", Males = "Percent Change 2005-2012, Males", Females = "Percent Change 2005-2012, Females")%>%
select(State, Both, Males, Females)
#transform dataframe for box plot
box_binge_state_change <- alcohol_binge_state_change%>%
select(Males, Females)
box_binge_state_change <- melt(box_binge_state_change)%>%
rename(Sexes = "variable", Prevalence = "value")## No id variables; using all as measure variables
binge_vs_sex <- glm(Sexes ~ Prevalence, data = box_binge_state_change, family = binomial())
summary(binge_vs_sex)##
## Call:
## glm(formula = Sexes ~ Prevalence, family = binomial(), data = box_binge_state_change)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8983 -0.6573 -0.0909 0.6486 3.2942
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.49684 0.54335 -4.595 4.32e-06 ***
## Prevalence 0.19895 0.03906 5.093 3.53e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 141.402 on 101 degrees of freedom
## Residual deviance: 93.338 on 100 degrees of freedom
## AIC: 97.338
##
## Number of Fisher Scoring iterations: 5
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -3.6703990 -1.5204899
## Prevalence 0.1296417 0.2841977
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.08234443 0.02546631 0.2186048
## Prevalence 1.22011564 1.13842038 1.3286956
#plot
ggplot(data = box_binge_state_change, aes(x= Sexes, y=Prevalence)) +
geom_boxplot() +
ylab("Prevalence Change (%)") +
xlab("Gender") +
ggtitle("Heavy Drinking Prevalence Change, Male vs. Female") +
theme(plot.title = element_text(hjust = 0.5))To analyze whether geographic location (different states) has impact on alcohol consumption level, I make 2X3 (heavy/binge vs. both/females/males) maps to visualize geographic impact
#import US map shapefile
US_sf <- st_read("acs_2012_2016_county_us_B27001/acs_2012_2016_county_us_B27001.shp",
stringsAsFactors = FALSE) %>%
select(GEOID, geometry)%>%
mutate(Code = stringr::str_sub(GEOID, 1, 2))## Reading layer `acs_2012_2016_county_us_B27001' from data source `/Users/slin/BMIN503_Final_Project/acs_2012_2016_county_us_B27001/acs_2012_2016_county_us_B27001.shp' using driver `ESRI Shapefile'
## Simple feature collection with 3141 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -2031905 ymin: -2427680 xmax: 2516374 ymax: 732103.3
## epsg (SRID): NA
## proj4string: +proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs
#See the changes at state level
#import state code, clean up
State_code <- read_excel("State Code.xlsx")
Code.update <- State_code$Code
Code.new <- str_pad(Code.update, 2, pad = "0")
State_code <- State_code%>%
mutate(Code = as.character(Code.new))
#join state code to alcohol_heavy_state_change
alcohol_heavy_state_change <- inner_join(alcohol_heavy_state_change, State_code, by = "State")
#join state code to alcohol_binge_state_change
alcohol_binge_state_change <- inner_join(alcohol_binge_state_change, State_code, by = "State")
#aggregate map information by state, join to cleaned heavy/binge data
US_states <- US_sf%>%
dplyr::select(geometry)%>%
aggregate(by = list(US_sf$Code), FUN = mean)%>%
rename(Code = "Group.1")%>%
mutate(Code = as.character(Code))
US_states_heavy <- inner_join(US_states, alcohol_heavy_state_change, by = "Code")
US_states_binge <- inner_join(US_states, alcohol_binge_state_change, by = "Code")
#plot
#set min&max for heavy
prev_h_both_min <- sort(US_states_heavy$Both)[1]
prev_h_both_max <- sort(US_states_heavy$Both, decreasing = TRUE)[1]
prev_h_male_min <- sort(US_states_heavy$Males)[1]
prev_h_male_max <- sort(US_states_heavy$Males, decreasing = TRUE)[1]
prev_h_female_min <- sort(US_states_heavy$Females)[1]
prev_h_female_max <- sort(US_states_heavy$Females, decreasing = TRUE)[1]
#set min&max for binge
prev_b_both_min <- sort(US_states_binge$Both)[1]
prev_b_both_max <- sort(US_states_binge$Both, decreasing = TRUE)[1]
prev_b_male_min <- sort(US_states_binge$Males)[1]
prev_b_male_max <- sort(US_states_binge$Males, decreasing = TRUE)[1]
prev_b_female_min <- sort(US_states_binge$Females)[1]
prev_b_female_max <- sort(US_states_binge$Females, decreasing = TRUE)[1]
#set theme
my_theme <- function() {
theme_minimal() +
theme(axis.line = element_blank(),
axis.text = element_blank(),
axis.title = element_blank(),
panel.grid = element_line(color = "white"),
legend.key.size = unit(0.6, "cm"),
legend.text = element_text(size = 12),
legend.title = element_text(size = 12),
plot.title = element_text(size = 16))
}
#set color for both, males, and females
myPalette_both <- colorRampPalette(brewer.pal(9, "Greens"))
myPalette_male <- colorRampPalette(brewer.pal(9, "Blues"))
myPalette_female <- colorRampPalette(brewer.pal(9, "Reds"))
#plot:heavy&binge, both
ggplot() +
geom_sf(data = US_states_heavy, aes(fill = Both), lwd = 0) +
my_theme() +
ggtitle("Change in Age-standardized Prevalence of Heavy Drinking\n 2005 to 2012, Both Sexes") +
scale_fill_gradientn(name = "Prevalence\nChange (%)", colours = myPalette_both(100),
limit = range(prev_h_both_min, prev_h_both_max)) ggplot() +
geom_sf(data = US_states_binge, aes(fill = Both), lwd = 0) +
my_theme() +
ggtitle("Change in Age-standardized Prevalence of Binge Drinking\n 2005 to 2012, Both Sexes") +
scale_fill_gradientn(name = "Prevalence\nChange (%)", colours = myPalette_both(100),
limit = range(prev_b_both_min, prev_b_both_max)) #plot:heavy&binge, males
ggplot() +
geom_sf(data = US_states_heavy, aes(fill = Males), lwd = 0) +
my_theme() +
ggtitle("Change in Age-standardized Prevalence of Heavy Drinking\n 2005 to 2012, Males") +
scale_fill_gradientn(name = "Prevalence\nChange (%)", colours = myPalette_male(100),
limit = range(prev_h_male_min, prev_h_male_max)) ggplot() +
geom_sf(data = US_states_binge, aes(fill = Males), lwd = 0) +
my_theme() +
ggtitle("Change in Age-standardized Prevalence of Binge Drinking\n 2005 to 2012, Males") +
scale_fill_gradientn(name = "Prevalence\nChange (%)", colours = myPalette_male(100),
limit = range(prev_b_male_min, prev_b_male_max)) #plot:heavy&binge, females
ggplot() +
geom_sf(data = US_states_heavy, aes(fill = Females), lwd = 0) +
my_theme() +
ggtitle("Change in Age-standardized Prevalence of Heavy Drinking\n 2005 to 2012, Females") +
scale_fill_gradientn(name = "Prevalence\nChange (%)", colours = myPalette_female(100),
limit = range(prev_h_female_min, prev_h_female_max)) ggplot() +
geom_sf(data = US_states_binge, aes(fill = Females), lwd = 0) +
my_theme() +
ggtitle("Change in Age-standardized Prevalence of Binge Drinking\n 2005 to 2012, Females") +
scale_fill_gradientn(name = "Prevalence\nChange (%)", colours = myPalette_female(100),
limit = range(prev_b_female_min, prev_b_female_max)) Also, given that there is a large gap between 2002/2005 vs. 2012 alcohol consomption in heavy & binge, I check how they change over the ten years
#get national alcohol use values (2002-2012)
changes_national <- combine_alcohol%>%
filter(State == "National")
changes_national <- changes_national[3:35]
#clean up
any_changes <- as.data.frame(matrix(as.numeric(changes_national[1,]), nrow = 3, ncol = 11))
heavy_changes <- as.data.frame(matrix(as.numeric(changes_national[2,]), nrow = 3, ncol = 11))
binge_changes <- as.data.frame(matrix(as.numeric(changes_national[3,]), nrow = 3, ncol = 11))
combine_changes_national <- rbind(any_changes, heavy_changes, binge_changes)
combine_changes_national$Degree <-c(rep("Any", 3), rep("Heavy", 3), rep("Binge",3))
combine_changes_national <- combine_changes_national%>%
rename("2002" = "V1", "2003" = "V2", "2004" = "V3", "2005" = "V4", "2006" = "V5", "2007" = "V6", "2008" = "V7", "2009" = "V8", "2010" = "V9", "2011" = "V10", "2012" = "V11")
#Plot change in heavy alcohol use from 2002 to 2012
#clean up
combine_changes_national <- melt(combine_changes_national, id.vars = "Degree")%>%
rename("Year" = "variable")
combine_changes_national$Sexes <- rep(c("Both", "Females", "Males"), 33)
combine_changes_national_heavy <- combine_changes_national%>%
filter(Degree == "Heavy")
#can use drop_na() to drop N/A data from 2002-2005
#plot
ggplot(combine_changes_national_heavy, aes(x = Year, y = value, colour = Sexes)) +
geom_line(aes(x = Year, y = value, group = Sexes)) +
scale_color_manual("Sexes", values = c("Both" = "green3", "Females" = "coral1", "Males" = "cornflowerblue")) +
ylab("Change in heavy alcohol use from 2002 to 2012 (%)")## Warning: Removed 9 rows containing missing values (geom_path).
#Change in binge alcohol use from 2002 to 2012 (%)
#clean up
combine_changes_national_binge <- combine_changes_national%>%
filter(Degree == "Binge")
#plot
ggplot(combine_changes_national_binge, aes(x = Year, y = value, colour = Sexes)) +
geom_line(aes(x = Year, y = value, group = Sexes)) +
scale_color_manual("Sexes", values = c("Both" = "green3", "Females" = "coral1", "Males" = "cornflowerblue")) +
ylab("Change in binge alcohol use from 2002 to 2012 (%)")Census article–>median household income
## Your original .Renviron will be backed up and stored in your R HOME directory if needed.
## Your API key has been stored in your .Renviron and can be accessed by Sys.getenv("CENSUS_API_KEY").
## To use now, restart R or run `readRenviron("~/.Renviron")`
## [1] "d38c26d19748e1f9eaaa8bde2d2fe0c6b6b0d606"
#import state longitudes and latitudes
states <- read.csv("States.csv")
#change the coordinates of AK and HI to following: AK:27.4,-111.9; HI:25.4,-103.3
#for the purpose to plot maps with translocated AK and HI
states[1,2]<-27.4
states[1,3]<--111.9
states[12,2]<-25.4
states[12,3]<--103.3
states <- states%>%
rename(State = "name")
#find the code for median household income in the past 12 months: B19013_001
variables <- load_variables(2011, "acs5", cache = TRUE)
#2010 median household income
acs.data.2010 <- get_acs(geography = "state",
year = 2010,
variables = c("B19013_001"))## Getting data from the 2006-2010 5-year ACS
#2011 median household income
acs.data.2011 <- get_acs(geography = "state",
year = 2011,
variables = c("B19013_001"))## Getting data from the 2007-2011 5-year ACS
#clean up median household income data
acs.data.2010 <- acs.data.2010%>%
rename(Code = "GEOID")%>%
select(-c(variable, moe))
acs.data.2011 <- acs.data.2011%>%
rename(Code = "GEOID")%>%
select(-c(variable, moe))
#clean up data of alcohol consumption in 2010&2011
#2010
heavy.2010 <- alcohol_heavy%>%
filter(State == Location)%>%
rename(Heavy = "2010 Both Sexes")%>%
select(State, Heavy)
binge.2010 <- alcohol_binge%>%
filter(State == Location)%>%
rename(Binge = "2010 Both Sexes")%>%
select(State, Binge)
heavy.binge.2010 <- inner_join(heavy.2010, binge.2010, by = "State")
heavy.binge.2010 <- inner_join(heavy.binge.2010, State_code, by = "State")
heavy.binge.2010 <- inner_join(heavy.binge.2010, states, by = "State")## Warning: Column `State` joining character vector and factor, coercing into character vector
heavy.binge.2010 <- st_as_sf(heavy.binge.2010, coords = c("longitude", "latitude"), crs = 4267)
estimate.2010 <- inner_join(US_states, acs.data.2010, by = "Code")
#2011
heavy.2011 <- alcohol_heavy%>%
filter(State == Location)%>%
rename(Heavy = "2011 Both Sexes")%>%
select(State, Heavy)
binge.2011 <- alcohol_binge%>%
filter(State == Location)%>%
rename(Binge = "2011 Both Sexes")%>%
select(State, Binge)
heavy.binge.2011 <- inner_join(heavy.2011, binge.2011, by = "State")
heavy.binge.2011 <- inner_join(heavy.binge.2011, State_code, by = "State")
heavy.binge.2011 <- inner_join(heavy.binge.2011, states, by = "State")## Warning: Column `State` joining character vector and factor, coercing into character vector
heavy.binge.2011 <- st_as_sf(heavy.binge.2011, coords = c("longitude", "latitude"), crs = 4267)
estimate.2011 <- inner_join(US_states, acs.data.2011, by = "Code")
#plot: 2010, median household income + heavy/binge
ggplot() +
geom_sf(data = estimate.2010, aes(fill = estimate, geometry=geometry), lwd = 0) +
scale_fill_gradientn(name = "Median\nHousehold\nIncome($)", # change legend title
colours = myPalette_both(100), limit = range(35000, 75000))+
geom_sf(data = heavy.binge.2010, aes(size = Heavy), color = "orange", alpha = 0.6, show.legend = "point") +
scale_size(range = c(1, 5)) +
my_theme() +
ggtitle("Comparison between Median Household Income and\n Binge Drinking, 2010")+
labs(size = "Heavy\nDrinking\nPrevalence(%)")ggplot() +
geom_sf(data = estimate.2010, aes(fill = estimate, geometry=geometry), lwd = 0) +
scale_fill_gradientn(name = "Median\nHousehold\nIncome($)", # change legend title
colours = myPalette_both(100), limit = range(35000, 75000))+
geom_sf(data = heavy.binge.2010, aes(size = Binge), color = "orange", alpha = 0.6, show.legend = "point") +
scale_size(range = c(1, 5)) +
my_theme() +
ggtitle("Comparison between Median Household Income and\n Binge Drinking, 2010")+
labs(size = "Binge\nDrinking\nPrevalence(%)")#plot: 2011, median household income + heavy/binge
ggplot() +
geom_sf(data = estimate.2011, aes(fill = estimate, geometry=geometry), lwd = 0) +
scale_fill_gradientn(name = "Median\nHousehold\nIncome($)", # change legend title
colours = myPalette_both(100), limit = range(35000, 75000))+
geom_sf(data = heavy.binge.2011, aes(size = Heavy), color = "orange", alpha = 0.6, show.legend = "point") +
scale_size(range = c(1, 5)) +
my_theme() +
ggtitle("Comparison between Median Household Income and\n Binge Drinking, 2011")+
labs(size = "Heavy\nDrinking\nPrevalence(%)")ggplot() +
geom_sf(data = estimate.2011, aes(fill = estimate, geometry=geometry), lwd = 0) +
scale_fill_gradientn(name = "Median\nHousehold\nIncome($)", # change legend title
colours = myPalette_both(100), limit = range(35000, 75000))+
geom_sf(data = heavy.binge.2011, aes(size = Binge), color = "orange", alpha = 0.6, show.legend = "point") +
scale_size(range = c(1, 5)) +
my_theme() +
ggtitle("Comparison between Median Household Income and\n Binge Drinking, 2011")+
labs(size = "Binge\nDrinking\nPrevalence(%)")Correlation
#2010
corr.drink.income.2010 <- inner_join(heavy.binge.2010, acs.data.2010, by = "Code")
corr.h.2010 <- corr.drink.income.2010$Heavy
corr.b.2010 <- corr.drink.income.2010$Binge
corr.i.2010 <-corr.drink.income.2010$estimate
cor.test(x = corr.h.2010, y = corr.i.2010, method = "spearman", exact = FALSE)##
## Spearman's rank correlation rho
##
## data: corr.h.2010 and corr.i.2010
## S = 13213, p-value = 0.009049
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.365515
##
## Spearman's rank correlation rho
##
## data: corr.b.2010 and corr.i.2010
## S = 12906, p-value = 0.006451
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.3802523
ggplot(corr.drink.income.2010, aes(estimate, Heavy))+
geom_point()+
geom_smooth(color = "red", method = "lm")+
xlab("2010 Median Household Income ($)")+
ylab("2010 Heavy Drinking Prevalence (%)")ggplot(corr.drink.income.2010, aes(estimate, Binge))+
geom_point()+
geom_smooth(color = "red", method = "lm")+
xlab("2010 Median Household Income ($)")+
ylab("2010 Binge Drinking Prevalence (%)")#2011
corr.drink.income.2011 <- inner_join(heavy.binge.2011, acs.data.2011, by = "Code")
corr.h.2011 <- corr.drink.income.2011$Heavy
corr.b.2011 <- corr.drink.income.2011$Binge
corr.i.2011 <-corr.drink.income.2011$estimate
cor.test(x = corr.h.2011, y = corr.i.2011, method = "spearman", exact = FALSE)##
## Spearman's rank correlation rho
##
## data: corr.h.2011 and corr.i.2011
## S = 13917, p-value = 0.0186
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.3317397
##
## Spearman's rank correlation rho
##
## data: corr.b.2011 and corr.i.2011
## S = 11855, p-value = 0.001792
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4307471
ggplot(corr.drink.income.2011, aes(estimate, Heavy))+
geom_point()+
geom_smooth(color = "red", method = "lm")+
xlab("2011 Median Household Income ($)")+
ylab("2011 Heavy Drinking Prevalence (%)")ggplot(corr.drink.income.2011, aes(estimate, Binge))+
geom_point()+
geom_smooth(color = "red", method = "lm")+
xlab("2011 Median Household Income ($)")+
ylab("2011 Binge Drinking Prevalence (%)")#see whether different
corr.drink.income.2010 <- corr.drink.income.2010%>%
mutate(Year = "2010")
corr.drink.income.2011 <- corr.drink.income.2011%>%
mutate(Year = "2011")
alcohol.estimate.combine <- rbind(corr.drink.income.2010, corr.drink.income.2011)
alcohol.estimate.combine %>%
group_by(Year) %>%
ggplot(aes(estimate, Heavy, color = Year)) +
geom_point()+
xlab("Median Household Income ($)")+
ylab("Heavy Drinking Prevalence (%)")alcohol.estimate.combine %>%
group_by(Year) %>%
ggplot(aes(estimate, Binge, color = Year)) +
geom_point()+
xlab("Median Household Income ($)")+
ylab("Binge Drinking Prevalence (%)") Describe your results and include relevant tables, plots, and code/comments used to obtain them. End with a brief conclusion of your findings related to the question you set out to address. You can include references if you’d like, but this is not required.